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Abstract 

In this paper, we propose a second order optimization method to learn models 
where both the dimensionality of the parameter space and the number of training 
samples is high. In our method, we construct on each iteration a Krylov subspace 
formed by the gradient and an approximation to the Hessian matrix, and then use 
a subset of the training data samples to optimize over this subspace. As with 
the Hessian Free (HF) method of [7], the Hessian matrix is never explicitly con- 
structed, and is computed using a subset of data. In practice, as in HF, we typically 
use a positive definite substitute for the Hessian matrix such as the Gauss-Newton 
matrix. We investigate the effectiveness of our proposed method on deep neural 
networks, and compare its performance to widely used methods such as stochastic 
gradient descent, conjugate gradient descent and L-BFGS, and also to HF. Our 
method leads to faster convergence than either L-BFGS or HF, and generally per- 
forms better than either of them in cross-validation accuracy. It is also simpler and 
more general than HF, as it does not require a positive semi-definite approximation 
of the Hessian matrix to work well nor the setting of a damping parameter. The 
chief drawback versus HF is the need for memory to store a basis for the Krylov 
subspace. 



1 Introduction 

Many algorithms in machine learning and other scientific computing fields rely on optimizing a 
function with respect to a parameter space. In many cases, the objective function being optimized 
takes the form of a sum over a large number of terms that can be treated as identically distributed: 
for instance, labeled training samples. Commonly, the problem that we are trying to solve consists 
of minimizing the negated log-likelihood; 



N 

f{e) = -log(p(Y|X;0)) = -^log(p(y,|x„;0)) (1) 

1=1 

where (X, Y) are our observations and labels respectively, and p is the posterior probability of our 
labels which is modeled by a deep neural network with parameters 9. In this case it is possible to use 
subsets of the training data to obtain noisy estimates of quantities such as gradients; the canonical 
example of this is Stochastic Gradient Descent (SGD). 

The simplest reference point to start from when explaining our method is Newton's method with 
line search, where on iteration m we do an update of the form: 



Om+l — Om~ aH„ g,„, (2) 

where H„i and g„i are, respectively, the Hessian and the gradient on iteration m of the objective 
function (1); here, a would be chosen to minimize (1) at 0m+i. For high dimensional problems 
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it is not practical to invert the Hessian; however, we can efficiently approximate (2) using only 
multiplication by Hm, by using the Conjugate Gradients (CG) method with a truncated number of 
iterations. In addition, it is possible to multiply by Hm without expUcitly forming it, using what is 
known as the "Pearlmutter trick" [11] (although it was known to the optimization community prior 
to that; see [10, Chapter 8]) for multiplying an arbitrary vector by the Hessian; this is described 
for neural networks but is applicable to quite general types of functions. This type of optimization 
method is known as "truncated Newton" or "Hessian-free inexact Newton" [9]. In [3], this method 
is appUed but using only a subset of data to approximate the Hessian H^. A more sophisticated 
version of the same idea was described in the earlier paper [7], in which preconditioning is applied, 
the Hessian is damped with the unit matrix in a Levenberg-Marquardt fashion, and the method is 
extended to non-convex problems by substituting the Gauss-Newton matrix for the Hessian. We will 
discuss the Gauss-Newton matrix and its relationship with the Hessian in Section 2. 

Our method is quite similar to the one described in [7], which we will refer to as Hessian Free (HF). 
We also multiply by the Hessian (or Gauss-Newton matrix) using the Pearlmutter trick on a subset 
of data, but on each iteration, instead of approximately computing (Hm + XI)^^gm using truncated 
CG, we compute a basis for the Krylov subspace spanned by Hmg„j, . . . H^^^gm for some 
K fixed in advance (e.g. K = 20), and numerically optimize the parameter change within this 
subspace, using BFGS to minimize the original nonlinear objective function measured on a subset 
of the training data. It is easy to show that, for any A, the approximate solution to -|- AI found by 
K iterations of CG will lie in this subspace, so we are in effect automatically choosing the optimal 
A in the Levenburg-Marquardt smoothing method of HF (although our algorithm is free to choose a 
solution more general than this). We note that both our method and HF use preconditioning, which 
we have glossed over in the discussion above. Compared with HF, the advantages of our method 
are: 

• Greater simplicity and robustness: there is no need for heuristics to initialize and update 
the smoothing value A. 

• Generality: unlike HF, our method can be applied even if H (or whatever approximation or 
substitute we use) is not positive semidefinite. 

• Empirical advantages: our method generally seems to work better than HF in both opti- 
mization speed and classification performance. 

The chief disadvantages versus HF are: 

• Memory requirement: we require storage of K times the parameter dimension to store the 
subspace. 

• Convergence properties: the use of a subset of data to optimize over the subspace will 
prevent convergence to an optimum. 

Regarding the convergence properties: we view this as more of a theoretical than a practical problem, 
since for typical setups in training deep networks the residual parameter noise due to the use of data 
subsets would be far less than that due to overtraining. 

Our motivation for the work presented here is twofold: firstly, we are interested in large-scale non- 
convex optimization problems where the parameter dimension and the number of training samples 
is large and the Hessian has large condition number. We had previously investigated quite different 
approaches based on preconditioned SGD to solve an instance of this type of optimization problem 
(our method could be viewed as an extension to [12]), but after reading [7] our interest switched 
to methods of the HF type. Secondly, we have an interest in deep neural nets, particularly to solve 
problems in speech recognition, and we were intrigued by the suggestion in [7] that the use of 
optimization methods of this type might remove the necessity for pretraining, which would result in 
a welcome simplification. Other recent work on the usefulness of second order methods for deep 
neural networks includes [2, 6]. 

2 The Hessian matrix and the Gauss-Newton matrix 

The Hessian matrix H (that is, the matrix of second derivatives w.r.t. the parameters) can be used 
in HF optimization whenever it is guaranteed positive semidefinite, i.e. when minimizing functions 
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that are convex in the parameters. For non-convex problems, it is possible to substitute a positive 
definite approximation to the Hessian. One option is the Fisher information matrix, 

F=^&gr, (3) 

i 

where indices i correspond to samples and the gi quantities are the gradients for each sample. This 
is a suitable stand-in for the Hessian because it is in a certain sense dimensionally the same, i.e. it 
changes the same way under transformations of the parameter space. If the model can be interpreted 
as producing a probabiUty or likelihood, it is possible under certain assumptions (including model 
correctness) to show that close to convergence, the Fisher and Hessian matrices have the same 
expected value. The use of the Fisher matrix in this way is known as Natural Gradient Descent [1]; 
in [12J, a low-rank approximation of the Fisher matrix was used instead. Another alternative that has 
less theoretical justification but which seems to work better in practice in the case of neural networks 
is the Gauss-Newton matrix, or rather a sUght generalization of the Gauss-Newton matrix that we 
will now describe. 

2.1 The Gauss-Newton matrix 

The Gauss-Newton matrix is defined when we have a function (typically nonlinear) from a vector to 
a vector, / : M" ^ K™. Let the Jacobian of this function be J e K™^", then the Gauss-Newton 
matrix is G = J-^J, with G e M"^". If the problem is least-squares on the output of /, then 
G can be thought of as one term in the Hessian on the input to /. In its appUcation to neural- 
network training, for each training example we consider the network as a nonlinear function from 
the neural-network parameters to the output of the network, with the neural-network input treated 
as a constant. As in [13], we generalize this from least squares to general convex error functions by 
using the expression HJ, where H is the (positive semidefinite) second derivative of the error 
function w.r.t. the neural network output. This can be thought of as the part of the Hessian that 
remains after ignoring the nonlinearity of the neural-network in the parameters. In the rest of this 
document, following [7] we will refer to this matrix J-^HJ simply as the Gauss-Newton matrix, or 
G, and depending on the context, we may actually be referring to the summation of this expression 
over a number of neural-network training samples. 

2.2 Efficiently multiplying by the Gauss-Newton matrix 

As described in [13], it is possible to efficiently multiply a vector by G using a version of the 
"Pearlmutter trick"; the algorithm is similar in spirit to backprop and we give it as Algorithm 1. Our 
notation and our derivation for this algorithm differ from [11, 13], and we will explain this briefly; 
we find our approach easier to follow. The idea is this: we first imagine that we are given a parameter 
set d, and two vectors 6i and O2 which we interpret as directions in parameter space; we then write 
down an algorithm to compute the scalar s = 0jG0i. Assume the neural-network input is given 
and fixed; let v be the network output, and write it as v(0) to emphasize the dependence on the 
parameters, and then let vi be defined as 

vi = lim -v(6' + a0i) - v(0), (4) 

Q->-o a 

so that vi — J01. We define V2 similarly. These can both be computed in a modified forward pass 
through the network. Then, if H is the Hessian of the error function in the output of the network 
(taken at parameter value 6), s is given by 

,s = v^Hvi, (5) 

since v^Hvi = ^^J^HJ^i = 0|^G0i. The Hessian H of the error function would typically 
not be constructed as a matrix, but we would compute (5) given some analytic expression for H. 
Suppose we have written down the algorithm for computing s (we have not done so here because 
of space constraints). Then we treat 9i as a fixed quantity, but compute the derivative of s w.r.t. 62 
(taking 62 around zero for convenience). This derivative equals the desired product Gdi. Taking 
the derivative of a scalar w.r.t. the input to an algorithm can be done in a mechanical fashion via 
"reverse-mode" automatic differentiation through the algorithm, of which neural-net backprop is a 
special case. This is how we obtained Algorithm 1. In the algorithm we denote the derivative of s 
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w.r.t. a quantity x by x, i.e. by adding a hat. We note that in this algorithm, we have a "backward 
pass" for quantities with subscript 2, which did not appear in the forward pass, because they were 
zero (since we take 62 = 0) and we optimized them out. 

Something to note here is that when the linearity of the last layer is softmax and the error is negated 
cross-entropy (equivalently negated log-likelihood, if the label is known), we actually view the soft- 
max nonlinearity as part of the error function. This is a closer approximation to the Hessian, and it 

remains positive semidefinite. 

To explain the notation of Algorithm 1 : h'^*-' is the input to the nonlinearity of the i'th layer and v^*^ is 
the output; means elementwise multipUcation; is the nonUnear function of the i'th layer, and 
when we apply it to vectors it acts elementwise; W*^'^^ is the neural-network weights for the first layer 
(so h(^' = W'^^'v'^"', and so on); we use the subscript 1 for quantities that represent how quantities 
change when we move the parameters in direction di (as in (4)). The error function is written as 
£{w^^\y) (where L is the last layer), and y, which may be a discrete value, a scalar or a vector, 
represents the supervision information the network is trained with. Typically £ would represent 

a squared loss or negated cross-entropy. In the squared-loss case, the quantity £^£{v^^\y) in 
Line 10 of Algorithm 1 is just the unit matrix. The other case we deal with here is negated cross 
entropy. As mentioned above, we include the soft-max nonUnearity in the error function, treating 
the elements of the output layer v^^^ as unnormalized log probabilities. If the elements of v^^^ are 
written as Vj and we let p be the vector of probabihties, with pj = exp(i!j)/ exp(t!j), then the 
matrix of second derivatives is given by 

^f(v(^),2/) = diag(p)-pp^. (6) 



Algorithm 1 Compute product 62 = GOi. MultiplyG(0, 0i, x, y) 

1: // Note, 6> = (W^.W^^^...) and6>i = (wi'\ W^^\ . . .). 

2: v(°) X 

3: v(°) ^ 

4: for l = l...L do 

5. hW W(')v('-i) 

6: hfVw(')v('-^^+w['V('-i) 

7: v«^0(O(h(O) 

8: v^'^ ^</.'»(h«)0h(') 

9: end for 

11: for 1 = L...1 do 

12: h('Vv(')0,^'(')(hW) 

14: W^')^h^'^v('-i)^ 

15: end for 

16: return §2 = (w^^\ • ■ • , W^^') 



3 Krylov Subspace Descent: overview 

Now we describe our method, and how it relates to Hessian Free (HF) optimization. The discussion 
in the previous section (on the Hessian versus Gauss-Newton matrix) is orthogonal to the distinction 
between KSD and HF, because either method can use any Hessian substitute, with the proviso that 
our method can use the Hessian even when it is not positive definite. 

In the rest of this section we will use H to refer to either the Hessian or a substitute such as G or F. 
In [7] and the work we describe here, these matrices are approximated using a subset of data samples. 
In both HF and KSD, the whole computation is preconditioned using the diagonal of F (since this is 
easy to compute); however, in the discussion below we wiU gloss over this preconditioning. In HF, 
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on each iteration the CG algorithm is used to approximately compute 

d = -(H + AI)-ig, (7) 

where d is the step direction, and g is the gradient. The step size is determined by a backtracking line 
search. The value of A is kept updated by Levenburg-Marquardt style heuristics. Other heuristics 
are used to control the stopping of the CG iterations. In addition, the CG iterations for optimizing 
d are not initialized from zero (which would be the natural choice) but from the previous value of 
d; this loses some convergence guarantees but seems to improve performance, perhaps by adding a 
kind of momentum to the updates. 

In our method (again glossing over preconditioning), we compute a basis for the subspace spanned 
by {g, Hg, . . . , H^~^g, dpiov}, which is the Krylov subspace of dimension K, augmented with the 
previous search direction. We then optimize the objective function over this subspace using BFGS, 
approximating the objective function using a subset of samples. 

4 Krylov Subspace Descent in detail 

In this section we describe the details of the KSD algorithm, including the preconditioning. 

For notation purposes: on iteration n of the overall optimization we will write the training data set 
used to obtain the gradient as An (which is always the entire dataset in our experiments); the set used 
to compute the Hessian or Hessian substitute as B„; and the set used for BFGS optimization over 
the subspace, as C„. For clarity when dealing with multiple subset sizes, we will typically normalize 
all quantities by the number of samples: that is, objective function values, gradients, Hessians and 
the Uke will always be divided by the number of samples in the set over which they were computed. 

On each iteration we will compute a diagonal preconditioning matrix D (we omit the subscript n). 
D is expected to be a rough approximation to the Hessian. In our experiments, following [7], we 
set D to the diagonal of the Fisher matrix computed over An- To precondition, we define a new 
variable = T)^/^6, compute the Krylov subspace in terms of this variable, and convert back to the 
"canonical" co-ordinates. The result is the subspace spanned by the vectors 

{(D-iH)*=D"ig,0 < A; < (8) 

We adjoin the previous step direction dprev to this, and it becomes the subspace we optimize over 
with BFGS. The algorithm to compute an orthogonal basis for the subspace, and the Hessian (or 
Hessian substitute) within it, is given as Algorithm 2. 



Algorithm 2 Construct basis V = [vi , . . . , v^f for the subspace, and the Hessian (or substitute) 
H in the co-ordinates of the subspace. 

1: vi ^ D-ig 

2: vi i r=vi 

3: for k=l...K + l do 

4: w Hvfe // If Gauss-Newton matrix, computed with Algorithm 1. 

5: if fc < tlien 

6: u ^ D~^w^ //uwillbevm+i 

7: else k = K then 

8: u dprev // Previous search direction; use arbitrary nonzero vector if 1st iter 
9: end if 

10: for j = \ . . .k do 

11: /ifcj w^Vj // Compute element of reduced-dimension Hessian 

12: u 4- u — (u^Vj)vj // Orthogonalize u 

13: end for 

14: if fc < X then 

15: Vij+i i 7==u // Normalize length and set next direction. 

vu^ u 

16: end if 

17: end for 

18: // Now set upper triangle of H to lower triangle. 
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Dataset 


Train smp. 


Test smp. 


Input 


Output 


Model 


Task 


CURVES 


20K 


lOK 


784 (bin.) 


784 (bin.) 


400-200-100-50-25-5 


AE 


MNIST^B 


60K 


lOK 


784 (bin.) 


784 (bin.) 


1000-500-250-30 


AE 


MNISTcL 


60K 


lOK 


784 (bin.) 


10 (class) 


500-500-2000 


Class 


MNISTcl,pt' 


60K 


lOK 


784 (bin.) 


10 (class) 


500-500-2000 


Class 


Aurora 


1.2M 


lOOK^ 


352 (real) 


56 (class) 


512-1024-1536 


Class 


Starcraft 


900 


100 


5077 (mix) 


8 (class) 


10 


Class 



Table 1: Datasets and models used in our setup. 



On each iteration of optimization, after computing the basis V with Algorithm 2 we do a further 
preconditioning step within the subspace, which gives us a new, non-orthogonal basis V for the 
subspace. This step is done to help the BFGS converge faster. 



Algorithm 3 Krylov Subspace Descent 

1: dprev ei // or any arbitrary nonzero vector 

2: for n = 1,2 . . do 

3: // Sample three sets from training data, An, Bn and d- 

4: S 'l2ieA II Get average function gradient over this batch. 

5: Set D to diagonal of Fisher matrix on An, floored to e times its maximum. 

6: Run Algorithm 2 to find V and H on subset Bn 

7: Let H be the result of flooring the eigenvalues of H to e times the maximum. 

8: Do the Cholesky decomposition H = CC''" 

9: Let V — VC~^ (do this in-place; is upper triangular) 

10: a ^ e R^+i 

11: Find the optimum a* with BFGS for about K iterations using the subset C„, with objective 

function nieasured &Xd -\- Va and gradient V^g (where g is the gradient w.r.t. 6). 
12: dprev ^ Va* 

13: e^e + dprev 

14: end for 



The complete algorithm is given as Algorithm 3. The most important parameter is K, the dimension 
of the Krylov subspace (e.g. 20). The flooring constant e is an unimportant parameter; we used 
10~*. The subset sizes may be important; we recommend that An should be all of the training data, 
and Bn and C„ should each be about 1 /K of the training data, and disjoint from each other but 
not from An- This is the subset size that keeps the computation approximately balanced between 
the gradient computation, subspace construction and subspace optimization. Implementations of the 
BFGS algorithm would typically also have parameters: for instance, parameters of the hne-search 
algorithm and stopping critiera; however, we expect that in practice these would not have too much 
effect on performance because the algorithm is likely to converge almost exactly (since the subspace 
dimension and the number of iterations are about the same). 

5 Experiments 

To evaluate KSD, we performed several experiments to compare it with SGD and with other second 
order optimization methods, namely L-BFGS and HF. We report both training and cross validation 
errors, and running time (we terminated the algorithms with an early stopping rule using held-out 
validation data). Our implementations of both KSD and HF are based on Matlab using Jacket^ to 
perform the expensive matrix operations on a Geforce GTX580 GPU with 1.5GB of memory. 

5.1 Datasets and models 

Here we describe the datasets that we used to compare KSD to other methods. 
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• CURVES: Artificial dataset consisting of curves at 28 x 28 resolution. The dataset consists 
of 20K training samples, and lOK testing samples. We considered an autoencoder network, 
as in [5]. 

• MNIST: Single digit vision classification task. The digits are 28 x 28 pixels, with a 60K 
training, and lOK testing samples. We considered both an autoencoder network, and clas- 
sification [5]. 

• Aurora: Spoken digits dataset, with different levels of real noise (airport, train station, ...). 
We used PLP features and performed classification of 56 EngUsh phones. These frame 
level phone error rates are the ones reported in Table 2. Also reported in the text are 
Word Error Rates, which were produced by using the phone posteriors in a Tandem system, 
concatenated with standard MFCC to train a Hidden Markov Model with Gaussian Mixture 
Model emissions. Further details on the setup can be found in [14]. 

• Starcraft: The dataset consists of a real time strategy video game sequences from 1000 
games. The goal is to predict the strategy the opponent chose based on a fully observed 
game sequence after five minutes, and features contain orderings between buildings, pres- 
ence/absence features, or times that certain buildings were built. 

The models (i.e. network architectures) for each dataset are summarized in Table 1. We tried to 
explore a wide variety of models covering different sizes, input and output characteristics, and tasks. 
Note that the error reported for the autoencoder (AE) task is the L2 norm squared between input 
and output, and for the classification (Class) task is the classification error (i.e. 100-accuracy). The 
non linearities considered were logistic functions for all the hidden layers except for the "coding" 
layer (i.e. middle layer) in the autencoders, which was linear, and the visible layer for classification, 
which was softmax. 

5.2 Results and discussion 

Table 2 summarizes our results. We observe that KSD converges faster than HF, and tends to lead to 
lower generalization error. Our implementation for the two methods is almost identical; the steps that 
dominate the computation (computing objective functions, gradients and Hessian or Gauss-Newton 
products) are shared between both and are computed on a GPU. 

For all the experiments we used the Gauss-Newton matrix (unless otherwise specified). The di- 
mensionality of the Krylov subspace was set to 20, the number of BFGS iterations was set to 30 
(although in many cases the optimization on the projected gradients converged before reaching 30), 
and an L2 regularization term was added to the objective function. However, motivated by the ob- 
servation that on CURVES, HF tends to use a large number of iterations, we experimented with a 
larger subspace dimension of iiT = 80 and these are the numbers we report in Table 2. 

For compatibility in memory usage with KSD, we used a moving window of size 10 for the L-BFGS 
methods. We do not report SGD performance in Figures 1 and 2 as it was worse than L-BFGS. 

When using HF or KSD, pre-training helped significantly in the MNIST classification task, but not 
for the other tasks (we do not show the results with pre-training in the other cases; there was no 
significant difference). However, when using SGD or CG for optimization (results not shown), pre- 
training helped on all tasks except Starcraft (which is not a deep network). This is consistent with 
the notion put forward in [7] that it might be possible to do away with the need for pre-training 
if we use powerful second-order optimization methods. The one exception to this, MNIST, has 
zero training error when using HF and KSD, which is consistent with a regularization interpretation 
of pre-training. This is opposite to the conclusions reached in [4] (their conclusion was that pre- 
training helps by finding a better "basin of attraction"), but that paper was not using these types of 
optimization methods. Our experiments support the notion that when using advanced second-order 
optimization methods and when overfitting is not a major issue, pre-training is not necessary. We 
are not giving this issue the attention it deserves, since the primary focus of this paper is on our 
optimization method; we may try to support these conclusions more convincingly in future work. 

'For MNISTcL,pT we initialize the weights using pretraining RBMs as in [5]. In the other experiments, 
we did not find a significant difference between pretraining and random initialization as in [7]. 

^We report both classification error rate on a lOOK CV set, and word error rate on a 5M testing set with 
different levels of noise 
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HF 


KSD 


Dataset 


Tr err 


CV err 


Time 


Tr. err. 


CV err. 


Time 


CURVES 


0.13 


0.19 




0.17 


0.25 


0.2 


MNIST^B 


1.7 


2.7 




1.8 


2.5 


0.2 


MNISTcL 


0% 


2.01% 




0% 


1.70% 


0.6 


MNISTcL^PT 


0% 


1.40% 




0% 


1.29% 


0.6 


Aurora 


5.1% 


8.7% 




4.5% 


8.1% 


0.3 


Starcraft 


0% 


11% 




0% 


5% 


0.7 



Table 2: Results comparing two second order methods: Hessian Free and Kjylov Subspace Descent. 
Time reported is relative to the running time of HF (lower than 1 means faster). 



In Figures 1 and 2, we show the convergence of KSD and HF with both the Hessian and Gauss- 
Newton matrices. HF eventually "gets stuck" when using the Hessian; the algorithm was not de- 
signed to be used for non-positive definite matrices. Even before getting stuck, it is clear that it does 
not work well with the actual Hessian. Our method also works better with the Gauss-Newton matrix 
than with the Hessian, although the difference is smaller Our method is always faster than HF and 
L-BFGS. 
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-> HF, Hessian matrix 
— LBFGS 
■A-HF, GN matrix 
■■»«■■ KSD, Hessian matrix 
HK-KSD, GN matrix 




"X 

\ 






^\ ">. 
\^ ■> 








^ 









Io9,o(time(s)) 




— LBFGS 

HF, Hessian matrix 
-+- KSD, GN matix, K=ao 
■X- KSD, Hessian matrix, K=20 
-A- HF, GN matrix 
-•- KSD, GN matrix, K=20 



io9,o(time(s)) 



Figure 1: Aurora convergence curves for various Figure 2: CURVES convergence curves for vari- 
algorithms. ous algorithms. 



6 Conclusion and future work 

In this paper, we proposed a new second order optimization method. Our approach relies on effi- 
ciently computing the matrix-vector product between the Hessian (or a PSD approximation to it), 
and a vector Unlike Hessian Free (HF) optimization, we do not require the approximation of the 
Hessian to be PSD, and our method requires fewer heuristics; however, it requires more memory. 

Our planned future work in this direction includes investigating the circumstances under which 
pre-training is necessary: that is, we would like to confirm our statement that pre-training is not 
necessary when using sufficiently advanced optimization methods, as long as overfitting is not the 
main issue. Current work shows that the presented method is also able to efficiently train recursive 
neural networks, with no need to use the structural damping of the Gauss-Newton matrix proposed 
in [8]. 
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